# Load all necessary packages here
library(tidyverse)
library(readxl)
library(tigris)
library(leaflet)
library(htmltools)
library(htmlwidgets)
library(scales)
library(plotly)
library(moderndive)
library(leaflet.extras)
# create a vector of all the states in the West region of the U.S.
West = c("WA", "OR", "ID", "MT", "WY", "CO", 
         "UT", "NV", "CA", "AK", "HI")

# create a vector of all the states in the Southwest region of the U.S.
Southwest = c("AZ", "NM", "TX", "OK")

# create a vector of all the states in the Southeast region of the U.S.
Southeast = c("AR", "LA", "MS", "AL", "GA", "FL", "SC", 
              "TN", "NC", "KY", "VA", "WV", "DC", "DE", "MD")

# create a vector of all the states in the Northeast region of the U.S.
Northeast = c("NJ", "CT", "RI", "PA", "NY", "MA", "NH", "VT", "ME")

# create a vector of all the states in the Midwest region of the U.S.
Midwest = c("ND", "SD", "NE", "KS", "MO", "IA", "MN", 
            "WI", "IL", "IN", "OH", "MI")


# make a dataframe for our first dataset (poverty and household income)
# filter for year 2010
# select variables of importance
poverty_income <- read_excel("state_series_1980-2014.xls") %>%
  filter(year == 2010) %>%
  select(fips, state_code, median_hhinc, percent_pov, total_pop)

# make a dataframe for our second dataset (dropout rates)
# select variables of importance
# to get ready to join: rename a column
# and rename other variables we are using
dropout_rates <- read_excel("sdr091a.xls") %>%
  select(FIPST, STATENAME, DRP912,
         TOTDHI, TOTDBL, TOTDWH, TOTDAS, TOTDAM) %>%
  rename(fips = FIPST, state_name = STATENAME, dropouts = DRP912,
         total_hispanic = TOTDHI, total_black = TOTDBL, total_white = TOTDWH,
         total_asian = TOTDAS, total_american = TOTDAM)


# join by FIPS state numeric code
# remove District of Columbia
# add a new variable called "region" to the dataframe based on what region the state is in
dropout_data <- inner_join(dropout_rates, poverty_income, by = "fips") %>%
  filter(!(state_name == "District of Columbia")) %>%
  mutate(region = case_when(state_code %in% West ~ "West",
                            state_code %in% Southwest ~ "Southwest",
                            state_code %in% Southeast ~ "Southeast", 
                            state_code %in% Northeast ~ "Northeast",
                            state_code %in% Midwest ~ "Midwest"))

Introduction

In order to do this comparison and analysis, we used the state dropout data1 from the Common Core of Data (CCD)2 surveys that are submitted every year to the National Center for Education Statistics (NCES). Additionally, we used demographic and economic data3. This data came from the U.S. Bureau of Labor Statistics Local Area Statistics Project, U.S. Census Bureau Small Area Income and Poverty Estimates, and U.S. Census Bureau Population and Housing Estimates. We chose to focus on state, region, race, median household income, poverty level, and total resident population (all people who are usually residents of a specific state4).

The interactive map provides a closer view on each state’s statistics, the barplot allows for analysis of dropouts by race, and the simple and multiple regression models give an insight on how poverty levels and income influence dropout rates.

Interactive Stacked Barplot

We created an interactive stacked barplot5 and chose one common way to divide the United States into five regions6. To easily compare the heights of the different colors between the bars, you can hover your mouse over a color, and a popup label with more information will appear. To show/hide a race, click on it in the legend7.

# group by region
# count: number of Hispanic drop outs, number of Black drop outs,
# number of White drop outs, number of Asian/Hawaiian Native/Pacific Islander drop outs,
# number of American Indian/Alaska Native drop outs, total population
barplotData <- dropout_data %>%
  group_by(region) %>% 
  summarize(sum_hispanic = sum(total_hispanic), sum_black = sum(total_black), 
            sum_white = sum(total_white), sum_asian = sum(total_asian), 
            sum_american = sum(total_american), sum_pop = sum(total_pop))


# make stacked barplot (a stack for each race)
plot_ly(data = barplotData, x = ~region, y = ~sum_hispanic, 
        type = 'bar', name = 'Hispanic',
        text = paste("Total Region Population:", comma(barplotData$sum_pop)),
        marker = list(color = 'rgb(0,0,128)')) %>%
  add_trace(y = ~sum_black, name = 'Black',
            marker = list(color = 'rgb(30,144,255)')) %>%
  add_trace(y = ~sum_white, name = 'White',
            marker = list(color = 'rgb(135,206,250)')) %>%
  add_trace(y = ~sum_asian, name = 'Asian/Hawaiian Native/Pacific Islander',
            marker = list(color = 'rgb(0,191,255)')) %>%
  add_trace(y = ~sum_american, name = 'American Indian/Alaska Native',
            marker = list(color = 'rgb(20, 106, 162)')) %>%
  layout(title ="Total High School Dropouts by Race in the 2009-2010 School Year",
         yaxis = list(title = 'Number of Dropouts', tickformat = ",d"), 
         xaxis = list(title = 'U.S. Region', categoryorder = "array",
                      categoryarray = c("West", "Southeast", "Midwest", 
                                        "Southwest", "Northeast")),
         barmode = 'stack',
         legend = list(x = 100, y = 0.5),
         annotations = list(yref = 'paper', xref = 'paper', y = 0.65, x = 1.13,
                            text = "Race", showarrow = F)) %>%
  config(displayModeBar = FALSE)

The descending order of the barplot8 allows us to easily see that the West has the largest number of dropouts and the Northeast has the fewest. However, this could be due to the fact that the West has a larger population than the Northeast. However, this reasoning doesn’t explain why the West has more dropouts than the Southeast or Midwest since both of these regions have larger populations than the West.

Among all races, Hispanics have the highest number of dropouts in the West and Southwest, while Whites have the highest number in the Southeast, Midwest, and Northeast. Again, this could be because of the concentration of these races in specific regions. If we consider proportions across states, American Indian students have the highest average dropout rate (5.75%), followed by Black students (5.55%). On the other hand, Asian students and White students have the lowest average rates at 1.97% and 2.72%, respectively.

Interactive Choropleth Map

We created an interactive choropleth map9 which is colored by 2009-2010 dropout rates. Clicking on a state will display some key statistics to consider when analyzing dropout rates like the state’s total population, poverty level, and median household income. We start our map centered on the U.S. at zoom level 4. The icon with four arrows10 brings you back to this setting.

# load spatial data
states <- states()

# inner join spatial data and a dataframe
states_merged <- geo_join(states, dropout_data, "STUSPS", "state_code", how = "inner")

# make blue color palette based on the range of dropout rate numbers
pal_dropouts <- colorNumeric("Blues", domain=states_merged$dropouts)

# make popup labels
popup_label <- paste0("<strong>", states_merged$NAME, 
                      "</strong><br />Total Population: ", 
                      comma(states_merged$total_pop),
                      "<br />Dropout Rate: ", 
                      paste(format(round(states_merged$dropouts, 2), nsmall = 2), "%", 
                            sep = ""),
                      "<br />Percent in Poverty: ", 
                      paste(states_merged$percent_pov, "%", sep = ""),
                      "<br />Median Household Income: ", 
                      comma(states_merged$median_hhinc))
# make interactive map
# at start: center the map on the U.S.
# add icon to reset map to zoom level 4, centered on U.S.
leaflet(states) %>%
  addProviderTiles("CartoDB.Positron") %>%
  setView(-98.483330, 38.712046, zoom = 4) %>% 
  addPolygons(data = states_merged, 
              fillColor = ~pal_dropouts(states_merged$dropouts), 
              fillOpacity = 0.7, 
              weight = 0.2, 
              smoothFactor = 0.2, 
              highlight = highlightOptions(weight = 5, color = "#666",
                                           fillOpacity = 0.7, bringToFront = TRUE),
              popup = ~popup_label,
              label = states_merged$NAME) %>%
  addLegend(pal = pal_dropouts, 
            values = states_merged$dropouts, 
            position = "bottomright", 
            title = "Dropout Rate",
            labFormat = labelFormat(suffix = "%")) %>%
  addResetMapButton()

In the map, Arizona has the highest dropout rate (7.8%) and New Hampshire has the lowest dropout rate (1.2%). We also see that certain states have a higher or close dropout rate to the country’s average (7.43%) like Mississippi (7.40%) and Arizona (7.8%). These states also happen to be next to well performing states; Alabama (1.80%) is next to Mississippi, and Utah (2.60%) is right above Arizona.

This map shows us that states with dropout rates below 2% (New Hampshire, Indiana, Idaho, and Minnesota) have less than 6 million people. Whereas states that have high dropout rates (Arizona, Mississippi, and New Mexico) do not necessarily have a big population, but have at least more than 17% of people living in poverty.

Regression

Interactive Simple Linear Regression

We created two simple linear regression models11 to see the individual relationships between each variable, income and poverty, and dropout rates. Hovering your mouse over a point in a subplot will display more information about that observation. Hovering over a line will display what region the linear regression corresponds to. To show/hide a region, click on it in the legend. The linear regression line for each region is shown in each subplot. We did this because we wanted to explore the relationships between the variables across regions.

# scatterplot of median income vs. total dropout rates
# add hover text
# set colors
# add regression lines
# format y-axis
incomeplot <- ggplot(dropout_data, 
                     aes(x = dropouts, y = median_hhinc, color = region)) +
  geom_point(mapping = aes(text = paste("State: ", 
                                        dropout_data$state_name, 
                                        "<br />Region: ", dropout_data$region, 
                                        "<br />Total Population: ", 
                                        comma(dropout_data$total_pop), 
                                        "<br />Dropout Rate: ", 
                                        paste(format(round(dropouts, 2), nsmall = 2)), "%", 
                                        "<br />Median Household Income: ", 
                                        comma(dropout_data$median_hhinc), sep = ""))) +
  scale_color_manual(values = c('#000080', '#1E90FF', '#87CEFA', '#00BFFF', '#146AA2')) +
  labs(x = "Dropout Rate", y = "Median Household Income", 
       color = "", size = "") +
  geom_smooth(method = "lm", se = FALSE, size = 0.5,
              mapping = aes(text = paste("Region:", dropout_data$region))) +
  scale_y_continuous(label = comma) +
  theme(text = element_text(size = 10))


# scatterplot of poverty vs. total dropout rates
# add hover text
# set colors
# add regression lines
# format y-axis
povertyplot <- ggplot(dropout_data, 
                      aes(x = dropouts, y = percent_pov, color = region)) +
  geom_point(mapping = aes(text = paste("State: ", 
                                        dropout_data$state_name, 
                                        "<br />Region: ", 
                                        dropout_data$region, 
                                        "<br />Total Population: ", 
                                        comma(dropout_data$total_pop), 
                                        "<br />Dropout Rate: ", 
                                        paste(format(round(dropouts, 2), nsmall = 2)), "%", 
                                        "<br />Percent in Poverty: ", 
                                        paste(dropout_data$percent_pov, "%", sep = ""),
                                        sep = ""))) +
  scale_color_manual(values = c('#000080', '#1E90FF', '#87CEFA', '#00BFFF', '#146AA2')) +
  labs(x = "Dropout Rate", y = "Percent of People Living in Poverty", 
       color = "", size = "") +
  geom_smooth(method = "lm", se = FALSE, size = 0.5, 
              mapping = aes(text = paste("Region:", dropout_data$region))) +
  theme(text = element_text(size = 10)) +
  scale_y_continuous(labels = function(x) paste0(x, "%"))


# turn ggplot into plotly plot
povertyplot <- ggplotly(povertyplot, height = 500, tooltip = "text")

# turn ggplot into plotly plot
# format legend
incomeplot <- ggplotly(incomeplot, height = 500, tooltip = "text") %>% 
  layout(legend = list(orientation = "h", x = 0.1, y = -0.13))


# combine the two simple regressions in one figure
# add titles for each subplot
# add legend title
subplot(povertyplot, style(incomeplot, showlegend = FALSE), 
        titleX = TRUE, titleY = TRUE, margin = 0.045) %>%
  layout(annotations = list(
    list(x = -0.001, y = 1.06,
         text = "Relationship between Dropout Rates and Poverty Level", 
         showarrow = F, xref = 'paper', yref = 'paper'),
    list(x = 0.958, y = 1.06,
         text = "Relationship between Dropout Rates and Income", 
         showarrow = F, xref = 'paper', yref = 'paper'),
    list(x = 0, y = -0.196, text = "U.S. Region", 
         showarrow = F, xref = 'paper', yref = 'paper'))) %>%
  config(displayModeBar = FALSE)

The model that analyzed the relationship between median household income and dropout rates is the following:

\[ \widehat{ Dropouts\ } = 4.564 -0.0000208 \cdot MedianIncome \]

The coefficient of the Income variable tells us that by every dollar increase in median income in each state, we would expect to see a decrease in dropouts rates by 0.0000208, on average. This coefficient was statistically significant (p-value: 0.021 \(<\) 0.05). The intercept tells us that if the median income was 0, we would expect to see a 4.564% dropout rate. This intercept was not statistically significant (p-value: 0.326 \(>\) 0.05).

We see a negative relationship between median income and dropout rates. We can see this trend in our graph on the right were most regions have a negative relation, except for the West.

The simple linear regression model on the relationship between percent of people in poverty and dropout rates is the following:

\[ \widehat{ Dropouts\ } = 1.05214 +0.16852 \cdot PercentPoverty \]

This coefficient tells us that by a one percent increase in the percent of people living in poverty in each state, we would expect the high school dropout rate to increase by 0.16852% on average. This coefficient did not have a statistically significant effect (p-value: 0.46489 \(>\) 0.0.5) on the dropout rates. The intercept tells us that if the percent of people living in poverty was zero, we would expect the dropout rate to be 1.05214%. The intercept was statistically significant (p-value: 0.00237 \(<\) 0.05).

The relationship between dropout rates and poverty level is positive, something we can see in our graph on the left for all regions but the West.

These two graphs give us a better understanding of why we observed states with a high poverty level and relatively high dropout rates. They also inform us that the relationship between income and dropout rates isn’t as significant and possibly explain why we didn’t see a pattern in our map before.

Even though we understand the relationship between these two variables and the dropout rates better, we wanted to explore what the relationship between them was when both income and poverty are used as predictors for dropout rates. We decided to used multiple regression to explore this idea.

multipleregressionmodel <- lm(dropouts~percent_pov+median_hhinc, dropout_data)
summary(multipleregressionmodel)
## 
## Call:
## lm(formula = dropouts ~ percent_pov + median_hhinc, data = dropout_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4518 -1.0091 -0.3034  0.7761  3.4140 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -8.546e+00  4.065e+00  -2.102  0.04090 * 
## percent_pov   4.274e-01  1.257e-01   3.400  0.00138 **
## median_hhinc  1.165e-04  4.778e-05   2.438  0.01862 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.436 on 47 degrees of freedom
## Multiple R-squared:  0.2064, Adjusted R-squared:  0.1726 
## F-statistic: 6.112 on 2 and 47 DF,  p-value: 0.004371
get_regression_table(multipleregressionmodel)
term estimate std_error statistic p_value lower_ci upper_ci
intercept -8.546 4.065 -2.102 0.041 -16.724 -0.369
percent_pov 0.427 0.126 3.400 0.001 0.175 0.680
median_hhinc 0.000 0.000 2.438 0.019 0.000 0.000
#plot(multipleregressionmodel)

Interactive Multiple Regression Plot

We created a multiple regression plot12 to model dropout rates as a function of poverty level and median household income. Hovering your mouse over an observation will display more precise information about its poverty level, dropout rate, and median household income. You can also hover over the regression plane to see its x-value (poverty level), y-value (median household income), and z-value (dropout rate). Clicking and dragging moves the axes around for different perspectives.

set.seed(56)

# Get coordinates of points for 3D scatterplot
x_values <- dropout_data$percent_pov
y_values <- dropout_data$median_hhinc
z_values <- dropout_data$dropouts

# Construct x and y grid elements
x_grid <- seq(from = min(x_values), to = max(x_values))
y_grid <- seq(from = min(y_values), to = max(y_values))


# Construct z grid by computing
# 1) fitted beta coefficients
# 2) fitted values of outer product of x_grid and y_grid
# 3) extracting z_grid (matrix needs to be of specific dimensions)
beta_hat <- dropout_data %>%
  lm(dropouts ~ percent_pov + median_hhinc, data = .) %>%
  coef()

fitted_values <- crossing(y_grid, x_grid) %>%
  mutate(z_grid = beta_hat[1] + beta_hat[2]*x_grid + beta_hat[3]*y_grid)

z_grid <- fitted_values %>%
  pull(z_grid) %>%
  matrix(nrow = length(x_grid)) %>%
  t()


# plot using plotly
# add 3D scatterplot
# add regression plane
# add title, format legend, add axis labels
plot_ly(data = dropout_data, height = 500) %>%
  add_markers(x = x_values, y = y_values, z = z_values, color = ~region,
              colors = c('#000080', '#1E90FF', '#87CEFA', '#00BFFF', '#146AA2'),
              marker = list(size = 5),
              hoverinfo = 'text',
              text = ~paste('Percent in Poverty:', 
                            paste(percent_pov, "%", sep = ""), 
                            '<br>Dropout Rate:', 
                            paste(format(round(dropouts, 2), nsmall = 2), "%", sep = ""), 
                            '<br>Median Household Income:',
                            comma(median_hhinc))) %>%
  add_surface(x = x_grid, y = y_grid, z = z_grid,
              colorscale = list(c(0,1), c("rgb(224, 249, 238)", "rgb(0, 255, 142)")),
              color = ~z_values, colorbar = list(title = 'Dropout Rate'),
              hoverinfo = 'x+y+z') %>%
  layout(margin = list(l = 0.02, r = 0.02, b = 0, t = 120), title = "3D Scatterplot and Regression Plane of <br>
         Dropout Rates, Poverty Level, and Median Household Income",
         legend = list(x = 1.01, y = 0.4),
         annotations = list(x = 1.148, y = 0.41, text = "U.S. Region", 
                            showarrow = F, xref = 'paper', yref = 'paper'),
         scene = list(
           zaxis = list(title = "Dropouts", ticksuffix = "%"),
           yaxis = list(title = "Income"),
           xaxis = list(title = "Poverty", ticksuffix = "%"))) %>%
  config(displayModeBar = FALSE)

The multiple regression model is the following: \[ \widehat{ Dropouts\ } = -8.546 + 0.4274 \cdot PercentPoverty + 0.0001165 \cdot MedianIncome \]

We found that in this multiple regression model, both the percent of people living in poverty and the median household income have a positive and statistically significant association with dropout rates. From this model we can see that a one percent increase in the percent of people living in poverty is associated with a 0.4274 increase in dropout rates, holding everything else constant. Additionally, a one dollar increase in median income is associated with a 0.0001165 increase in dropout rates, taking into consideration all other variables. Whenever the median income and the percent of poverty are zero, we would expect the dropout rate to be $-$8.546%.

Although this model gives us a better understanding of how these two variables are predicting dropout rates, it only explains 17% of the variability in dropout rates. This could be attributed to the many differences across states that are not accounted for in this model. Perhaps, other variables should be taken into consideration as predictors, like total school population and average attendance in high school, in order to create a better model.

Conclusion

Through this project we were able to analyze dropout rates more thoroughly. We saw how dropout rates varied across different states and races through our interactive map and bar graph. Additionally, we got a deeper understanding of how other factors like income and poverty influenced dropout rates by running simple and multiple regression models.

We learned that dropout rates have decreased in the past decades, but not enough to compete with other developed countries. Additionally, we learned that Hispanics and Whites make most of the high school dropout population across all races, not Blacks, as how it is often believed. We also learned that there is a lot of variability across states in dropout rates; some states have very low dropout rates while others have very high ones. Lastly, we were able to explore and understand how income and poverty were associated with dropout rates. In later research we would like to consider other variables that could explain some of the variability across states, and also compare the United States to other countries.

References and Citations


  1. “State Dropout and Completion Data.” National Center for Education Statistics (NCES), a Part of the U.S. Department of Education, National Center for Education Statistics, nces.ed.gov/ccd/drpcompstatelvl.asp.

  2. “Common Core of Data (CCD).” National Center for Education Statistics (NCES) Home Page, a Part of the U.S. Department of Education, National Center for Education Statistics, nces.ed.gov/ccd/.

  3. “Arts & Sciences, Public Policy.” William and Mary, www.wm.edu/as/publicpolicy/schroedercenter/for-faculty/Downloadable%20Health%20Datasets/State%20Level%20Downloadable%20Health%20Datasets/index.php.

  4. Daly, Michael. “Documentation: State Variable Longitudinal Dataset [1980 – 2014].” 12 Feb. 2016, www.wm.edu/as/publicpolicy/schroedercenter/for-faculty/Downloadable%20Health%20Datasets/State%20Level% 20Downloadable%20Health%20Datasets/Documentation%20State%20Variable%20Longitudinal%20Data%201980-2014.pdf.

  5. “Bar Charts.” Modern Visualization for the Data Era - Plotly, plot.ly/r/bar-charts/.

  6. National Geographic Society. “United States Regions.” National Geographic Society, 9 Nov. 2012, www.nationalgeographic.org/maps/united-states-regions/.

  7. “Legends.” Modern Visualization for the Data Era - Plotly, plot.ly/r/legend/.

  8. mtoto. “Ordering in r plotly barchart.” Stack Overflow, 20 Oct. 2016, stackoverflow.com/questions/40149556/ordering-in-r-plotly-barchart/40149815.

  9. Tran, Andrew Ba. “Interactive Choropleth Maps.” Interactive Choropleth Maps :: Journalism with R, learn.r-journalism.com/en/mapping/census_maps/census-maps/.

  10. Karambelkar, Bhaskar. “Leaflet.extras.” Function | R Documentation, www.rdocumentation.org/packages/leaflet.extras/versions/1.0.0/topics/addResetMapButton.

  11. “Subplots.” Modern Visualization for the Data Era - Plotly, plot.ly/r/subplots/.

  12. Kim, Albert. “Plotly R Code for Interactive 3D Scatterplot & Regression Plane of Seattle House Prices.” Gist, gist.github.com/rudeboybert/9905f44013c18d6add279cf13ab8e398.